wo 03/085585 



PCT/FI03/00248 



26 



Claims 

1. A method for gene mapping from genotype and phenotype data, -Wchmemod 
utilizes Unkage disequiUbrium between genetic markers m. which are polymorphic 
:uc"c add or proL sequences or strings of single-nucleotide polymorphisms 
5 deriving from a chromosomal region, characterized m that 

i) all marker patterns P that satisfy a pattern evaluation function e(P) are 
searched from the data, wherein 
a the marker patterns are expressions involving die marker-allele assign- 
' ments and zero or more of the following: individual covariates, environ- 
10 mental variables and auxiUary phenotypes; and 

b the pattern evaluation function e(P) involves some statistical measi^e of 
' the association between tiie marker pattern P and tiie phenotype bemg 
Studied, 

by testing each marker of pattern P against die corresponding allele pak in 
15 genotype G, effectively finding out if tiiere is a possible haplotype configura- 

tion of G which matches P and counting the possible matches as matches, 
U) each marker mi of die data is scored by a marker score s(mi), which is a func- 
tion of the set S defined as die set of marker patterns overlapping the marker 
m • and L pattern evaluation function . as defined m step (i). and 

the location of the gene is predicted as a function of die scores ^"^^^^ 
markers in die data and is based on maximizing die score if die scoring 
function is designed to give higher scores closer to the gene, and on mimmiz- 
ing die score if die scoring function is designed to give l^^^^^^^J^ 
dif gene, as is die case for instance when die scores s(mi) are marker-wise p 

25 values 

2. A mediod according to claim 1. characterized in diat a marker is scored as 
die sum of die weights of overlapping patterns. 

3. A mediod according to claim 2. characterized in diat die weight of a pattern is 
a function of 



20 iii) 
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. the uncertainty of matching, e.g. 2»-«'^ where N[i] is the number of het- 
erozygous markers within the pattern in genotype i, summed over all 
matched genotypes, or 

- the informativeness of the pattern, e.g. 2", where H is the average heterozy- 
5 gosity within the pattern, or 

- the strength of association, e.g. chi-squared. 

4. A method of claim 1, characterized in that the marker patterns P are searched 
for by the following algorithm: 

Input 

10 • set U of possible marker patterns 

• evaluation function e{P) for patterns P 'mU 

m (generalization) relation < for patterns in 1/ ^ „, r> 

. where the function e and the relation < are such tiiat if .(P) is time and P < P, 

tiien eiP") is also tine 

15 Output 

• set S = { P e 1/ I e(P) is true } of patterns 

Method 

1. 5:={} 

2. // Initialize tiie set of evaluated patterns: 

20 3. £ := { } 

4 // Start witii the most general patterns: 

5. Gen ;= {P in 1/ 1 there is no P' in U, P' != P. such that P' < P) 

6. // Recursively evaluate patterns in a depth first order: 

7. foreach P 6 Gen { evaluatePattems(P) } 

25 8. end; 

9. procedure evaluatePattems(P) { 

10. insert P into the set E 

11. if e(P) = true then { 

12. insert P into set S 

30 13. // Find all specializations of P tiiat have not been tested yet, and 

14 // evaluate tiiem recursively: c d"? p 

15 Spec := { P' in U-E\P< P\ P' 1= P> and there is no P" m U-E, P .= f 
15'. and P " != P: with P<P"< P'); 

17. foreach P' in Spec { evaluatePattems(P'); } 
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10 



18. } 

19. ) 

5. A method of claim 1. characterized in that the marker patterns P are searched 
for by the following algorithm: 

Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns PinU 

• frequency threshold x 

Output ,r.x • f A 

• set 5 = {P in I e(P) and aeiP) is true} of patterns, where ae(P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 



Method 
20.5: = {} 

21 . // Initialize the set of evaluated patterns: 

15 22.£:={} 

23. // Start with the most general patterns: 

24. Gen :={P in U\ there is no P' in U, P' != P. such thatP->P'} 

25. // Recursively evaluate patterns in a depth-first order: 

26. foreach P in Gen { evaluatePattems(P) } 

20 27. end 

28. procedure evaluatePattems(P) { 

29. insert P into the set E 

30. if aeiP) = true then { 

31 if c(P) = true then insert P into set S 

25 32. // Find all speciaUzations of P that have not been tested yet, and evaluate 

33. // them recursively: 

34. Spec := { P' in U-E \ P' -> P. P' != P> and there is no P " in U-E, P " \- P 
25^ and P" != P'. with P' -> P" and P" -> P ] 

36. foreach P' in Spec { evaluatePattemsCPO } 

30 37. } 
38.) 

6. A method of claun 1. characterized in that the marker pattems P are searched 
for by the following algorithm: 
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Input 

• marker map M = (jnj, ... ,m0 

• phenotype vector Y = (1^/, - f Yn) 

, genotype martx H of size „ • » • 2 (« persons. Ic markers. 2 aUeles per person 

5 and marker) 

• association threshold x for chi-squared test 

• maximum pattern length / 

• maximum number of gaps g 

• maximum gap size s 

10 Output 

• «!et 5 = I P in 1/ 1 e(P) is true } of patterns, 

: tL U — of patterns on M tt.ac consist of marker-allele assignments and 

that adhere to parameters /, g, and i, and 
. "(P) is Le if and only if chi-squared test on P using genotype matnx H 

15 and phenotypes Y exceeds the given threshold x 

Method 
39.5: = {} 

40. // Number of case and control persons: 

41. piA ■•= number of affected persons; 
20 42.pic number of control persons; 

43. pi . = piA + P^C 

44. // A lower bound for pattern frequency: 
45. lb :=piA*Pi*^^ * P' + P^A * ^) 

46.// Variable for iterating over different patterns: 
25 47. P = (Pi PJO •= ^'*'' - ' '*'^ 

48 for i ;= i to { . , 

49. // aUeles(m,) is the set of alleles of the i:th marker 

50. foreach a in allelesCmj) ( 

51. pi:=a 

30 52. // Test pattern P and all its extensions: 

53. checkPattems(P, i. i, 0, 0) 

54. //Reset p," 

55. pj := '*' 

56. } 
35 57. } 

58. end 
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59. // Test haplotype pattern P and aU patterns that can be generated by extending P 

60. // from the right: 

61. procedure checkPattems(P, start, i. nr_of_gaps, gapjength) { 

62. // Output strongly associated patterns 

5 63. if chi-squared(P. M. H,Y)>=x and pi != '*' then insert P into set S 

6^.11 Return if extended patterns would be too long: 

65. if i = k or i+1 -start > I then return 

66. // Return if extended patterns can not be strongly disease-associated: 

67. if frequency of P in affected persons is less than lb 

10 68. then return; 

69. // Create and test legal extensions of current pattern P (3 cases): 

70. // 1. Give marker i+1 all possible values: 

71. foreach a in alleles(mj+i) { 

72. pi+i := a 

15 73. checkPattems (P, start, i+I, nr_of_gaps. O) 

74. } 

75. // 2. Introduce a new gap starting at marker i+7: 

76. if Pi * '*' and nr_of_gaps < g and 5 > i then { 

77. pi+i := '*' 

20 78. CheckPattems (F, start, nr_of_gaps+ 1 . 1) 

79. ) 

80. // 3. Extend the current gap over marker 

81. if Pi='*' and gapjength < s then { 

82. pi+i := 

25 83. CheckPattems (P. start, i+1. nr_of_gaps. gapJength+1) 

84. } 

85. // Before remming, resetpj+j: 
S6.pi+i := '*' 

87.retum 
30 88. } 

7. A method of claim 1, characterized in that the marker patterns P are searched 
for by the following algorithm: 
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Input 

• set U of possible marker patterns 

• evaluation function c(P) for patterns P inU 
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• (generalization) relation < for patterns in C/, where the function e and the rela- 
tion < are such that if is true and P' < P, then eiP') is also true 

Output 

• set S = {P in 1 is true } of patterns 
Deftnitions 

• function Lgg: U -> 2^ LggiP) = { P' in U \ P > P' and P' \= P and there is no 
P"inU such that P \= P" 1= P' and P > P" > P'}, the setof least general gen- 
eralizations of pattern P. 

• function Lss: U -> 2", Us{P) ={P'inU \ P < P' and P' 1= P and there is no 
P"mU such that P \= P" 1= P' and P < P" < P'}, the set of least special spe- 
cializations of pattern P. 



Method 
89.5:= {} 
90.Q: = {] 

15 9 1 . // Start with the most general patterns : 

92. F : = [P inU\ there is no P' in U. P' != P, such thatP'<P}; 

93. while F != { } { 

94. // Evaluate the candidate patterns: 

95. foreachPinF { 

20 96. if e(P) = true then insert P into set S 

97. else remove P from set F 

98. } 

99. Q'- = Q union F 

100. // Generate a new set of candidate patterns: 

25 101. C:={} 

102. foreachPinF { 

103. C : = C union { P' in U\P' in LssiP) and for all P" in Lgg^P'): 

104. P"inG} 

105. } 

30 106. F: = C 

107. ) 

108. end 

8. A metiiod of claim 1 , characterized in that tiie marker patterns P are searched 
for by the following algorithm: 
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Input 

• set U of possible marker patterns 

• evaluation function e{P) for patterns P inV 

• frequency threshold x 

' T'Zs = (P in U 1 e(/>) <«<d aeiP) is true) of patterns, where ae^P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 

DefiniUons r„fPWlP'inl/ I p. > P' am/ P' is"" 

. funcuon ^-^^ ''"^ ^^'^^^f .'^'^ i p.. .> p.), the set of least general 
10 P" 'mU such that P\-P ^ ana f ^ r ^ t, 

generalizations of pattern P. „ j „ , , « *i,^r^ i e no 

. function U -> 2", LssiP) = { P' inU \P' -> P and P != P and fh^re no 
• funcUoni^5 u ^ . , p,. _> the set of least special 

P" in 1/ sMc/i that P \= P 1= P ana f > r ^ /. 

specializations of pattern P. 

15 Method 

109. S: = {} 

110. Q' = (} 

111 // Start with the most general patterns: 

112. F:= {P in l/U/ierei^/wP' in t/.P'!= P. 5«c/if/wrP->P }; 

20 113. while F!= {} { 

1 14. // Evaluate the candidate patterns: 

115. foreachPinP I 

116. if ae(P) = frwe then { 

J if e(P) = frwe then insert P into set S 

25 118. } 

119. else remove P from set F 

120. } 

121. e: = eunionP 

122. // Generate a new set of candidate patterns: 

30 123. C : = { } 

124. foreachPinF { „ r ^«,/^p'\- 

125 C • = C union I P' in 1/ 1 P' in Ls5(P) a^for all P m Lgg(P ). 

126. ^"^'^Si 

127. } 

35 128. F : = C 
129. ) 
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130. end 

9. A method of claim 1 , characterized in that 

a) the phenotype being studied is qualitative, and 

b) the pattern evaluation function e(P) has the fonn e(P) = true if and only if 
5 e'(P) > X, where e'iP) is the (signed) association measure x and x is a user- 
specified' minimum value, which is chosen so that the sizes of S. are large 
enough, such as 7, to give statistically sufficientiy reliable estimates for the 
gene locus, and 

c) the score s(mi) of marker mi is the size of S„ also called marker-wise pat- 
JO tern frequency of mj and denoted by f( mi). 

10. A method of claim 1 . characterized in that 

a) the pattern evaluation function e(P) has die form e(P) = true if and only if 
e-(P) > X, where e'(P) is the absolute frequency of pattern P in the data and 
X is a user-specified value, which is chosen so that the sizes of are large 

15 enough, such as 20, to give statistically sufficiently reUable estimates for 

the gene locus, and, 

b) in order to derive the score s(mi), the p value (statistical significance) of 
each marker pattern P in determining tiie phenotype being studied is evalu- 
ated, and 

20 c) the score s(mi) is the distance between the observed p value distiibution of 

patterns in 5. and die uniform distribution, defined as average of (p. - qd 
log (p, / q.) over all i = L.n. where n is the number of haplotype Pattems m 
Si pTis the itii smallest p value in S, and 9, is the expectation of the .th 
smallest p value, if die p values were randomly drawn from die uniform 

25 distribution. 

1 1 A method of claim 10, characterized in tiiat die p value is computed using a 
linear model of form y = + • . • + AX. + A. where die dependent vanable 
Y is die phenotype being studied. X. dirough X, are covariates, such as envuron- 
mental factors, and Z is a dummy variable for die occurrence of die haplotype pat- 
30 tem, and 

die coefficients a and >& are adjusted for best fit, and tiien 
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the significance of Z as a covariate is assessed by using a t test with the null hy- 
pothesis "a= 0". 

12. A method of claim 1, characterized in that each score j(m/> is refined by re- 
placing it by the marker-wise p value of the score s(mi), where the statistical signifi- 
cance of s(mi) is measured against tiie null hypotiieses that tiiere is no gene effect. 

13. A method of claim 12, characterized in tiiat the marker- wise p values p(mi) 
are determined by randomly permuting phenotypes. 

14. A method of claim 1 , characterized in that the area returned from the predic- 
tion of the gene location is contiguous or fragmented or a point. 

15. A method of claim 1 , characterized in tiiat the location of tiie gene, predicted 
as a function of the scores s(mi) and based on maximizing or minimizing die score, 
is predicted to the location of the marker mj tiiat maximizes or minimizes tiie mar- 
ker score s(mi). 

16. A metiiod of claim 1, characterized in tiiat tiie location of tiie gene, predicted 
as a function of tiie scores s(mi) and based on maximizing or minimizing tiie score, 
is predicted to tiie combination of most probable intervals for containing tiie trait- 
susceptibility locus tiiat covers at most tfie desired proportion t (te {0,100%}) of tiie 
original region obtained by taking all such points in the studied chromosomal region 
whose nearest marker is wittiin tiie k best scoring markers, where k is selected such 
tiiat tiie resulting area has lengtii at most t times tiie lengtii of tiie studied region, and 
where k is maximal such value. 

17. A metiiod of claim 1, characterized in tiiat tiie location of tiie gene, predicted 
as a function of tiie scores s(mi) and based on maximizing or minimizing tiie score, 
is predicted to tiiose points in tiie studied chromosomal region whose nearest 
marker scores at least y or at most y. where y is scoring fiinction dependent and is 
selected so tiiat tiie probability of the gene being close to tiie marker is sufficientiy 
large. 

18. A metiiod of claim 1, characterized in that tiie location of tiie gene, predicted 
as a fimction of tiie scores s(mi) and based on maximizing or minimizing tiie score, 
is determined by expert investigation of tiie marker scores or tiieir visuaUzation. 
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19. A method of claim 1, characterized in that several genes are searched for si- 
multaneously by using marker patterns that refer to several potential gene loci at the 



same time. 



20. A computer-readable data storage medium having computer-executable pro- 
5 gram code stored, characterized in that it is operative to perform a method of any 

of the preceding claims when executed on a computer. 

21. A computer system, characterized in that it is programmed to perform the 
method of any of the claims 1 to 19. 



